-Darshit Doshi and Jared Kohler
About Safegraph: "We obtain a variety of information (collectively the “Information”) from trusted third-party data partners such as mobile application developers. We collect this Information primarily through APIs, which are interfaces through which these app developers can provide us with information about their users. We sometimes collect the Information through other delivery methods, such as software development kits (“SDKs”) that are embedded directly into mobile apps."
The dataset we downloaded using is composed of several related tables with information about Cafes and Coffee shops in San Francisco, CA.
The first of these tables is the Core table, containing site information about each of the businesses contained in the dataset.
import pandas as pd
import fiona, os, shapely,json, copy
from shapely.geometry import shape, mapping
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
from IPython.display import Image
core = pd.read_csv('Data/CA-722515-CORE_POI-2019-09-27.csv')
core.T
The main information from this table that is useful to our analysis is the open_hours column.
core.open_hours.loc[2]
The table loaded below includes foot traffic data for a one month period between August and September of this year (2019).
patterns = pd.read_csv('Data/CA-722515-PATTERNS-2019_08-2019-09-27.csv')
patterns.head()
patterns.T
Initially cleaning removes data columns that are either not relevant to current analysis are are too sparsely recorded to be of use.
clean_patterns = patterns.drop(columns=['city', 'state', 'date_range_start', 'date_range_end', 'visitor_home_cbgs', 'visitor_work_cbgs', 'visitor_country_of_origin', 'related_same_day_brand', 'related_same_month_brand', 'device_type'])
clean_patterns_core = pd.merge(clean_patterns, core[['safegraph_place_id', 'open_hours']], left_on='safegraph_place_id', right_on='safegraph_place_id', how = 'left')
Dropping incorrectly included night club 'The cafe' which appears to have been incorrectly included in the data set of cafes.
clean_patterns_core = clean_patterns_core.drop(0)
Storing this data to a csv file for tableau analysis
clean_patterns_core.to_csv('clean_patterns_core.csv')
Main dataframe
clean_patterns_core.T
There are total 534 cafes in San Francisco out of which 18.2% are chain cafes such as Starbucks, Jamba, etc and rest of them are non-chain cafes.
Image(filename='Basic_analysis.JPG', width=600)
Further analyzing the chain cafes, we see that approx. 70% of them are Starbucks.
Image(filename='Brand_Distribution.png', width=600)
Plotting the geographic extent of our data (All cafes) on the map using Tableau and the shapefiles included with the dataset and seen later in the analysis.
Image(filename='geographic_extent.png', width=1000)
Plotting only the chain brands on the map to analyze how they're located throughtout the city. We found out that 90% of the chain stores are located in the small radius around the downtown district.
Image(filename='geographic_extent_only_chain.png', width=1000)
The data set includes data for each business, representing the foot-traffic in that business during each hour of the day. This data is stored in list objects with 24 values - one for each hour of the day. Extracting this and cleaning it to only show information for the hours that the business shows to be open.
popularity_hour = clean_patterns_core[['safegraph_place_id', 'location_name', 'brands', 'popularity_by_hour', 'open_hours']]
Checking for % of null values
len(popularity_hour[popularity_hour.open_hours.isna()])/len(popularity_hour)
Only considering the not null values
popularity_hour = popularity_hour[pd.notna(popularity_hour['open_hours'])]
Creating functions for extracting the start and end time
def clean_starttime(x):
if x:
return int(x[0][0].split(':')[0])
else:
return float('NaN')
def clean_endtime(x):
if x:
return int(x[0][1].split(':')[0])
else:
return float('NaN')
For the sake of analysis, Wednesday will be used to represent a typical day. The open hours for Wednesday will be used for visualization purposes.
popularity_hour['start_time']=popularity_hour.open_hours.apply(lambda x : clean_starttime(json.loads(x)['Wed']))
popularity_hour['end_time']=popularity_hour.open_hours.apply(lambda x : clean_endtime(json.loads(x)['Wed']))
popularity_hour = popularity_hour[popularity_hour.start_time.notna()]
popularity_hour.head().T
A plot showing the number of business opening at each hour of the day.
ph = popularity_hour.start_time.plot.hist(bins=24, width=1)
ph.set_xlabel('Hour of day')
ph.set_ylabel('Count of cafes')
ph.set_title('Common start time hours')
A similar plot done on tableau to find the number of cafes opening at each hour of the day. Most common time for cafes to open is 6:00-8:00 AM
popularity_hour.to_csv('start_end_time.csv')
Image(filename='start_time.JPG', width=500)
A plot done on tableau to find the number of cafes closing at each hour of the day. Most common time for cafes to close is 6:00-8:00 pm.
Image(filename='end_time.JPG', width=500)
popularity_hour_list = popularity_hour.values.tolist()
for x1 in popularity_hour_list:
x1[3] = [int(x1[3].strip('][').split(",")[i]) if (i+1) > x1[5] and (i+1) < x1[6] else 0 for i in range(len(x1[3].split(",")))]
popularity_hour_dataframe = pd.DataFrame(popularity_hour_list)
popularity_hour_dataframe = popularity_hour_dataframe.drop(columns=[4,5,6])
popularity_hour_dataframe.head()
popularity_hour_dataframe = popularity_hour_dataframe.rename(columns={0: "safegraph_", 1: "Name", 2:"Brands", 3: "Hours"})
for i in range(24):
popularity_hour_dataframe[str(i+1)] = popularity_hour_dataframe['Hours'].apply(lambda x:x[i])
popularity_hour_dataframe = popularity_hour_dataframe.drop(columns=['Hours'])
A simplified version of the traffic data. Now there is a column for each hour of the day with a value for each business.
popularity_hour_dataframe.head()
The third and final component of the dataset is a set of shape files, showing the outline of each of the business. Instead of the outlines, a centroid will be more useful, allowing for a bubble map of locations.
with fiona.open('Data/CA-722515-GEOMETRY-2019-09-27.shp') as src:
meta = src.meta
meta['schema']['geometry'] = 'Point'
with fiona.open('Data/centroids.shp', 'w', **meta) as dst:
for f in src:
centroid = shape(f['geometry']).centroid
f['geometry'] = mapping(centroid)
dst.write(f)
centroids= gpd.read_file('Data/centroids.shp')
The centroid values are merged with the hourly traffic data, producing a product nearly ready to map.
centroid_hours = centroids.merge(popularity_hour_dataframe, on='safegraph_')
A geojson file will serve well for mapping purposes
centroid_hours.to_file("centroid_hours3.geojson", driver='GeoJSON')
This screen capture shows the location of the cafes with the small blue dots. The hour represented is midnight, so almost all cafes are closed showing no business.
Image(filename='Midnight.png', width=1000)
In contrast, this screen capture represents mid-day. Now nearly all cafes are open and are colored and sized relative to their number of visitors.
Image(filename='Midday.png', width=1000)
Analyzing the traffic near the cafes when they're closed
popularity_hour.head()
popularity_hour_closed_list = popularity_hour.values.tolist()
for x1 in popularity_hour_closed_list:
x1[3] = [int(x1[3].strip('][').split(",")[i]) if (i+1) < x1[5] or (i+1) > x1[6] else 0 for i in range(len(x1[3].split(",")))]
popularity_hour_closed_dataframe = pd.DataFrame(popularity_hour_closed_list)
popularity_hour_closed_dataframe = popularity_hour_closed_dataframe.drop(columns=[4,5,6])
popularity_hour_closed_dataframe.head()
popularity_hour_closed_dataframe = popularity_hour_closed_dataframe.rename(columns={0: "safegraph_", 1: "Name", 2: "Brands", 3: "Hours"})
for i in range(24):
popularity_hour_closed_dataframe[str(i+1)] = popularity_hour_closed_dataframe['Hours'].apply(lambda x:x[i])
popularity_hour_closed_dataframe = popularity_hour_closed_dataframe.drop(columns=['Hours'])
popularity_hour_closed_dataframe.head()
centroid_closed_hours = centroids.merge(popularity_hour_closed_dataframe, on='safegraph_')
centroid_closed_hours.to_file("centroid_closed_hours3.geojson", driver='GeoJSON')
The data collection method is picking up traffic (represented in red) at most of the locations, even when they are shown to be closed. This suggests that in some fashion there is still foot traffic near these locations, potentially traffic that could be taken advantage of for business purposes.
A live version of an animated visualization can be seen at the below link. Detailed exploration of this tool could reveal numerous potential sites.
Image(filename='1am.png', width=800)
Image(filename='2pm.png', width=800)
By viewing the data in this way, it is possible to identify potential locations where traffic is high, even though shops are closed. One particular location that demonstrated this idea is near the Mission and 8th St Intersection. There is a bakery in the area that shows high traffic during all of its open hours, but it closes particularly early - approximately 4pm. Even after it is closed, the location and other surrounding locations show steady rates of human traffic, suggesting the possibility that there is un-tapped potential in this area for further business development, particularly in the later afternoon and evening hours.
Image(filename='Copy of Project 1 Final Presentation.png', width=1000)
With all median customer dwell times in chains topping out at 90 minutes. Assuming that chain stores have done extensive market research for profitability, the 20-40 minute time frame appears to be the ideal time for customer dwell.
Image(filename='Dwell_Chain.png', width=800)
Non-chain stores have clustered dwells at 20 and 40, but also have many instances of longer median dwell times. These long dwell stores also seem to have very small numbers of customers, suggesting that these shops are not thriving to the extent of the stores with faster customer turn-over.
Image(filename='Dwell2.png', width=800)
If a shop is committed to the idea of facilitating long customer experiences, one potential approach would be to follow the example of this Brooklyn shop which charges customers not by the item, but by the minute above 1 hour (after paying a $6 base fee.)
https://www.businessinsider.com/coffee-shop-charges-for-time-spent-2016-10
Image(filename='Annotation 2019-10-16 143234.png', width=1000)
Although the data is interesting and may provide some business insight, it is necessary to also provide some caution about its use and realiability.
Image(filename='Jamba.png', width=1000)
The above chart shows the median customer dwell time for all Jamba Juice stores. There is a high level of consistency clustered at the 20 min mark. Oddly, there is one store which shows double the time. Looking up this store's location reveals something interesting.
Image(filename='Annotation 2019-10-15 170550.png', width=1000)
The store location is directly next to an escape room location that advertises 60min customer experiences. If indeed this store is actually giving 20min turn-around for customers, it seems likely that the 40 min times are being corrupted by data from the store next door, mixing with the Jamba location. Perhaps some of the customers shown by GPS to be at the Jamba location are actually at the adjacent business and staying longer. This is just one particularly clear example of many such potential issues in the data.
“There is always inherent noise in our visit attribution algorithm, due to jumpiness/errors in GPS data (such as GPS drift, GPS ping scattering in city environments)”
“Our visit data is collected from a panel of mobile GPS devices data. The likely reasons you would see visits outside of business hours are (1) we are picking up visits in the building which are not to the actual POI (for instance a condo over a Starbucks) or (2) we are picking up night workers at a location.”
This leads to questions about all of the data. When the data shows that there are many individuals at a location after regular business hours, are these actually potential customers, or is the collection method merely picking of the presence of people in a nearby appartment building, bus stop, or theater?
-Use the data to fine-tune the hours and operations of current businesses to follow best practices of competitors or take advantage of missed opportunities
-Always double-check before making a financial decision based on the data due to the bad signal to noise ratio in SafeGraph’s collection methods
https://www.youtube.com/watch?v=hN3JWIhM7eQ&feature=youtu.be
Follow this link for a narrated, visual review of this analysis.